Linguistic Motivation in Automatic Sentence Alignment of Parallel Corpora: the Case of Danish-Bulgarian and English-Bulgarian

نویسندگان

  • Angel Genov
  • Georgi Iliev
چکیده

We report the results from a sentencealignment experiment on DanishBulgarian and English-Bulgarian parallel texts applying a method based in part on linguistic motivations as implemented in the TCA2 aligner. Since the presence of cognates has a bearing on the alignment score of candidate sentences we attempt to bridge the gap between source and target languages by transliteration of the Bulgarian text, written originally in Cyrillic. An improvement in F1-measure is achieved in both cases.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Lexical token alignment: experiments, results and applications

Lexical alignment is one of the most challenging tasks in processing and exploiting parallel texts. There are numerous applications that may benefit from an accurate multilingual lexical alignment of biand multi-language corpora. We describe in this paper a hypothesistesting approach to the problem of automatic extraction of translation equivalents from sentence-aligned and tagged parallel corp...

متن کامل

Bulgarian X-language Parallel Corpus

The paper presents the methodology and the outcome of the compilation and the processing of the Bulgarian X-language Parallel Corpus (Bul-X-Cor) which was integrated as part of the Bulgarian National Corpus (BulNC). We focus on building representative parallel corpora which include a diversity of domains and genres, reflect the relations between Bulgarian and other languages and are consistent ...

متن کامل

Linguistic Issues in Language Technology – LiLT

The paper describes the construction of a Bulgarian-English treebank aligned on the word and semantic level. We consider the manual word level alignment easier and more reliable than the manual alignment on syntactic and semantic level. Thus, after manual word level alignment we apply an automatic procedure for the construction of semantic level alignments. Our work presents the main steps of t...

متن کامل

Hierarchical Agglomerative Clustering of English-Bulgarian Parallel Corpora

Most multilingual parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the hierarchical agglomerative clustering (HAC) technique to cluster multilingual parallel text on web contents. A clustering algorithm taking constraints from parallel corpora potentially has several attractive features. Firstly...

متن کامل

Automatic Extraction of Translation Equivalents From Parallel Corpora

This paper presents a simple and effective method for extraction of translation equivalents from parallel corpora. Experiments were conducted on Orwell's "1984" parallel corpus with translations available in six CEE languages, all of them being aligned to the English original. There were extracted six bilingual lexicons X-English (En), where X stands for one of Czech (Cz), Bulgarian (Bg), Eston...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011